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Abstract 

The design and analysis of an algorithm for the 
restoration of degraded images of machine-printed 
characters is presented. The input is a set of de- 
graded bilevel images of a single unknown charac- 
ter; the output is an approximation to the charac- 
ter’s ideal artwork. The algorithm seeks to minimize 
the discrepancy between the approximation and the 
ideal, measured as the worst-case Euclidean distance 
between their boundaries. We investigate a family 
of algorithms which superimpose the input images, 
add up the intensities at each point, and threshold 
the result. We show that, under degradations due 
to random spatial sampling error, significant asymp- 
totic improvements can be achieved by suitably pre- 
processing each input image and postprocessing the 
final result. Experimental trials on special test shapes 
and Latin characters are discussed. 

1 Introduction 

In the last few years, a variety of document-image 
degradation models have been proposed, and their 
applications investigated [3]. Models of this sort 
are often instantiated as a software image generator 
which reads a single ideal prototype image (of, say, 
a machine-printed symbol), and, as directed by the 
model, writes an arbitrarily large number of pseudo- 
randomly degraded images. Here, we explore some 
implications of inverting this procedure by designing 
an image restorer which reads a set of degraded im- 
ages and attempts to recover, or closely approximate, 
the ideal artwork from which they were derived. 

This procedure is a special case of classical image 
restoration [12], of which image deconvolution is an- 
other, well-studied special case. The problem is sim- 
ilar in some respects to super-resolution surface re- 
construction from multiple images [6] and sub-pixel 
edge location in grey-level images [1], Our problem 


domain is distinguished from many that are treated 
in this literature by several factors: (1) we read many 
input images (not merely one); (2) the input images 
are bilevel (not grey or color); (3) the ideal image to 
be recovered is also bilevel and at a much higher ( e.g . 
x 10) spatial sampling rate than the input; (4) we 
do not, in general, know the parameters of the image 
degradation model; and, (5) we may know some con- 
straints on the class of images, such as the frequent 
occurrence along their boundaries of straight lines, 
sharp corners, and curves of large radius. For these 
reasons we foresee advantages in algorithms specially 
adapted to the problem domain. 

We anticipate several applications of such an al- 
gorithm. One is speeding up the adaptation of an 
OCR system to a given document, by a procedure of 
the following sort: for each of the most commonly 
confused symbols, a few images are lifted and their 
ideal prototype is inferred and input to the generator, 
which can write a much larger training set than can 
be collected from the document. Another application 
is calibration of a degradation model (estimation of 
its parameters) to a document for which ideal proto- 
types are unknown. 

We have investigated a family of algorithms which 
superimpose the input images, add up the intensities 
at each point, and threshold the result. Methods of 
this sort have been occasionally reported in the OCR 
literature [11], but we are not aware of any attempt 
to analyze their asymptotic performance or improve 
them by special pre- and post-processing. 

There are several generally related papers in the lit- 
erature on document image analysis which are worth 
mentioning although they neither attack the same 
problem nor use the same method. Billawa, Hart, and 
Peairs [5] studied the restoration of repeated defects 
at known locations in an image (e.g. scratches on a 
copier platen). Shin et al [13] explored not dissimilar 
methods for the purpose of contrast enhancement. 
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2 The Algorithm 

The basic idea of the Image Averaging algorithm is to 
superimpose the input images, add up the intensities 
at each point, and threshold the result to obtain a 
new binary image. A broad class of algorithms can 
be expressed in this form by suitably preprocessing 
each input image and postprocessing the final output 
as suggested by Figure 1. 

After the input images have been preprocessed, 
they can be superimposed by simply finding the cen- 
troid of each input image to subpixel accuracy and 
shifting the images so that the centroids coincide. It is 
also possible to use a feedback process to try to read- 
just the relative positions once a tentative consensus 
image is available as Hastie et. al. do [7], but this 
involves considerable complexity and does not avoid 
the theoretical limitations discussed in Section 4. 

The superimposed inputs can be converted to a bi- 
nary image by simple thresholding or by an edge de- 
tection algorithm such as Avrahami and Pratt [1] . 

As the picture suggests, the most important func- 
tion of the postprocessing step is to smooth out the 
high-frequency “wobbles” in the outlines produced 
by the thresholding process. Adding up input im- 
ages and thresholding them reduces the magnitude 
of this noise so that the wobbles are more readily dis- 
tinguished from desired features such as the serifs on 
the letter “A” in Figure 1. The smoothing algorithm 
needs to operate on polygonal outlines and produce 
output that fits the input as closely as possible while 
smoothing out wobbles up to some specified magni- 
tude. There should be no attempt to minimize the 
number of vertices in the output at the expense of 
quality of fit. 

A good choice of smoothing algorithm is Hobby’s 
algorithm [8]. This algorithm minimizes the num- 
ber of inflections in the resulting polygonal outlines 
subject to a bound on the deviation. It produces a 
description of a class of outlines that obey these cri- 
teria so that the output can be chosen to fit the input 
as well as possible. The post-processing step consists 
of choosing the maximum deviation parameter as a 
yet-to-be determined function of the number of input 
images and running Hobby’s algorithm. 

The other essential component of the Image Aver- 
aging algorithm is the preprocessing step. One option 
is to omit the preprocessing and just treat the inputs 
as binary images where the pixels are unit squares. 
This gives the Naive Averaging algorithm. 

Alternatively, Hobby’s polygonal smoothing algo- 
rithm can be applied to the input images as part of 
the preprocessing step, but this must be done care- 
fully to avoid destroying significant features. We con- 





sup erimpose, add 
up and threshold 



postprocess 


(smooth 

sharpen 


and 

corners) 



Figure 1: An example of how the algorithm performs 
with simulated input images as might be obtained 
from a 7 point font at approximately 300 dots/inch. 
Only the first two of 100 input images are shown here. 
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sidered various preprocessing options some involving 
the smoothing algorithm and some not. The op- 
tions were evaluated both empirically and according 
to their theoretical performance as the number of in- 
put images approaches infinity. The details appear in 
later sections; we first concentrate on presenting the 
most successful preprocessing strategies. 

2.1 Preprocessing by Smoothing the 
Outlines 

One approach to preprocessing is to take each ras- 
terized input image and try to guess the underlying 
shape that generated it. This can be thought of as 
“inverse rasterization,” and the algorithm of [8] can 
do it as shown in Figure 2. 


(a) (6) 

Figure 2: (a) A comparison of the original out- 

lines (thin lines) with the smoothed versions (thick 
lines) for one of input images from Figure 1; (b) the 
smoothed outlines filled in. 

The smoothing algorithm has an error tolerance pa- 
rameter e that must be set to at most | pixel in order 
to guarantee that a simple rasterization process could 
regenerate the original input outlines. In fact, it may 
be better to choose a value like e = g in order to avoid 
smoothing out features that may be significant. Us- 
ing this preprocessing strategy for Image Averaging 
yields the Presmoothing algorithm. 

2.2 Adding up the Inputs and Thresh- 
olding 

Once we have polygonal outlines for each of the input 
images, how do we add them up? One way would be 
to build a data structure that divides the plane into 
regions according to the number of input images that 
overlap at each point. This would require maintain- 
ing a potentially very large number of polygonal re- 
gions and figuring out how to update them given the 
outlines that describe a new input image. 

A more practical approach is first to rasterize 
each polygonal outline using any reasonable scan- 


conversion algorithm, and then add up the rasterized 
images. This rasterization should be done at a res- 
olution substantially higher than that of the input 
images since the resolution limits the precision of the 
final output. It is best to use a run-length represen- 
tation so that the run time and space requirements 
will not be quadratic in the resolution. 

Scale each input by some factor a, and assume that 
the scan-conversion algorithm finds all triplets of in- 
tegers ( x , y, d ) such that the scaled outlines cross the 
segment ( x,y)(x + 1 ,y) and d — ±1 depending on 
whether the crossing is in the downward or upward 
direction. After all the inputs have been rasterized 
in this fashion, we can collect all the triplets with 
a given y - value and sort them according to x. En- 
forcing a threshold t involves scanning the sorted list, 
maintaining the cumulative total of the d values and 
saving only those triplets that raise the total from 2—1 
to t or lower it from t to t — 1 . Knuth has published 
detailed implementations of all these algorithms [9, 
Parts 19,20,22]. 

2.3 Preprocessing via Smooth-Shaded 
Outlines 

Rather than trying to guess the underlying shape 
from looking at an input image, a better preprocess- 
ing strategy might be to try to express the range of 
possibilities. In other words, the preprocessed image 
should have fuzzy edges that simulate the result of 
averaging all the possible underlying images. This 
can be done by taking a closer look at the output 
of Hobby’s smoothing algorithm [8]. In addition to 
the polygonal approximation indicated by the dashed 
line in Figure 3, there is a sequence of trapezoids that 
the dashed line passes through. Instead of using the 
dashed line as a black-white boundary, each trapezoid 
can have smoothly varying gray levels such that the 
dashed line is 50% dark and the darkness reaches 0% 
and 100% at the parallel segments of the trapezoid 
boundary (thick lines in the figure). Image Averag- 
ing using this preprocessing strategy with tolerance 
e = g gives the Smooth-Shading algorithm. 

In order to make use of images with smooth-shaded 
edges as described above, we need to generalize the 
idea of rasterizing the images at a higher resolution 
and adding up the rasterizations. Consider a hori- 
zontal scan line passing through such an image. The 
darkness is a piecewise-linear function of the x co- 
ordinate along the line and the slope discontinuities 
occur at points where the scan line crosses trapezoid 
boundaries as shown in Figure 4. 

We can rasterize the trapezoids to get ( x,y,d ) 
triples as before, but the d values should represent 
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Figure 3: (a) An outline extracted from an input im- 
age; (b) the results of Hobby’s smoothing algorithm 
with tolerance e = The smoothed polygonal out- 
line is the dashed line and the other lines delimit 
trapezoids that define a range of alternative outlines. 



Figure 4: A section of a smooth-shaded image and 
the trapezoids that defined it. The graph shows the 
relative darkness along the dashed scan line. 


changes in the slope of the darkness function, not 
changes in the darkness itself. After rasterizing the 
trapezoids from all the input images, it is a simple 
matter to recreate the total darkness function from 
the assorted (a;, y, d ) triples on a particular scan line. 
The main difficulties are computing the correct d val- 
ues and preventing accumulated rounding errors from 
getting too large. This gives Algorithm 1. (See also 
Figure 5). 



Figure 5: A typical trapezoid for Algorithm 1 and 
one of the scan lines y = y passing through it. The 
y = y line is shown dashed. 

In order to avoid accumulating errors in AD in 
Step 8 of the algorithm, the third component of each 
triple in (1) should be stored as a fixed-point num- 
ber. There can still be accumulated error in the total 
darkness D, but this is partially alleviated by prop- 
erties of the trapezoids from [8]: trapezoids with a 
large horizontal extent tend to have Yu = T_i,l and 
YiR — Yi-i : R so that d' = d" in (1). (The bottom- 
most trapezoid in Figure 3b is an example of this.) 

3 Experimental Results 

We have experimented with the three special test 
shapes shown in Figure 6 and two Latin characters: 
the Helvetica capital ‘FT and Times Roman capital 
‘R’. All trials were run on input images pseudoran- 
domly generated by a program implementing the pa- 
rameterized image defect model described in [2]. The 
degraded images were at a nominal text size of 8 
point, and imaged at a spatial sampling rate of 300 
pixel/inch. The discrepancies between approximated 
and ideal boundaries are expressed in units of in- 
put pixel-width. In order to estimate the algorithm’s 
asymptotic performance, randomized trials on 250 in- 
put images were repeated 25 times and their means 
and standard errors computed. 

Table 1 reports the results of experiments on im- 
ages degraded by uniformly randomized spatial sam- 
pling error. It shows how the three algorithms per- 
formed on the test shapes and on the Helvetica and 
Times-Roman ‘R’s. The smooth-shading algorithms 
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did significantly better than naive averaging and pre- 
smoothing, and the test shapes were significantly eas- 
ier than the ‘R’s. 


Algorithm 1 How to do the “add up and threshold” 
step given a set of input outlines, a scale factor cr and 
a threshold T. 

1. Initialize j <— 1. 

2. Apply the Hobby’s smoothing algorithm [8] to 
the j-th outline, obtaining Pil — ( XiL,YiL ) and 
PiR = (X iR ,Yi_ r) for * = 0,1,2,... , rij. 

3. Let PiL = crP iL and Pm — a Pm for each i. Then 
initialize i <— 1. 


4. Find all integers y such that the trapezoid 
Pi-i, RpiRpiLPi-i,L intersects the line y = y and 
execute Step 5 for each pair of intersection points 
( x ',y ) and ( x",y ). 

5. Compute darkness values d! and d" for (x',y) 
and ( x ", y) by letting the darkness be 1 at A-i,l 
and P iL , 0 at Pi-i, R and PiR, and varying lin- 
early in between. Then store triples 



d" - d' \ 
\x" — x'\ ) 


and 



,y, 


d' - d" \ 

x " ~ x, \)(i) 


6. If i < rij, increment i and go back to Step 4. 

7. If there are more outlines, increment j and go 
back to Step 2. 

8. For each y such that there are triples with y = y, 
sort the triples by x and scan them in order, 
maintaining total darkness D and rate of change 
of darkness AD. Each time the total darkness 
crosses the threshold T, output the point (x, y) 
where this happens. 



Test shape 1 Test shape 2 Test shape 3 


Figure 6: Three test shapes that were used in the 
experiments. 


input 

shape 

smooth 

shading 

naive 

averaging 

pre- 

smoothing 

Test 1 

0.08 ±0.01 

0.11 ±0.02 

0.16 ±0.03 

Test 2 

0.12 ±0.02 

0.18 ±0.06 

0.18 ±0.01 

Test 3 

0.12 ±0.01 

0.16 ± 0.03 

0.20 ±0.04 

Times R 

0.47 ±0.05 

0.53 ±0.05 

0.51 ±0.10 

Helv. R 

0.63 ±0.04 

0.68 ± 0.12 

0.66 ±0.11 


Table 1: Error in recovering the ideal shape for vari- 
ous algorithms and input shapes under uniform spa- 
tial sampling error. Entries of the form y ± cr mean 
that the mean is y and the standard error is a. 

The test runs used Algorithm 2 from Section 4 to 
deal with the sharp corners in the test shapes. This 
performed as well as the analysis leads us to expect 
on the three test shapes, but less well on the ‘R’s. 

We also experimented with more complex docu- 
ment-image degradations, which have not yet yielded 
to analysis. Table 2 shows the effect of combining 
these degradations with uniform spatial sampling er- 
ror. The degradation labeled S in the table is skew 
(rotation) varying normally with mean 0.0 and stan- 
dard error 4.0 (degrees). For each image, this skew 
angle was passed to the image averaging algorithm, 
which attempted to correct for it; our motivation for 
this policy is that skew can often be estimated accu- 
rately from the complete page image. 


input 

shape 

Defect model 

U 

U + S 

U + B 

Test 1 

0.08 ±0.01 

0.16 ± 0.01 

0.19 ±0.02 

Test 2 

0.12 ±0.02 

0.32 ±0.04 

0.26 ±0.03 

Test 3 

0.12 ± 0.01 

0.23 ±0.05 

0.31 ±0.05 

Times R 

0.47 ±0.05 

0.66 ±0.04 

0.70 ±0.02 

Helv. R 

0.63 ±0.04 

0.67 ±0.05 

1.44 ±0.06 


Table 2: Error in recovering the ideal shape for the 
smooth-shading algorithm with various input shapes 
and image defect models. In the column labels, U 
refers to uniform spatial sampling error, S refers to 
known random skew, and B refers to randomized 
blurring and thresholding. Entries of the form y±a 
mean that the mean is y and the standard error is cr. 

The degradation labeled B in Table 2 is blurring 
and thresholding: the blurring is a circularly symmet- 
ric Gaussian kernel whose standard error varies nor- 
mally with mean 0.5 and standard error 0.5 (units of 
input pixels); and the threshold varies normally with 
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mean 0.5 and standard error 0.125 (intensity). This 

did not increase the errors much relative to the results 
for the variable-skew trials, with one striking excep- 
tion: on the Times-Roman ‘R’, the error was much 
larger (mean 1.50, a 0.06); on inspection, this was 
clearly due to blunting of the tips of sharp serifs. 

4 The Sharp-Corner Problem 

The algorithms tested in Section 3 tend to produce 
poor results for shapes that have sharp corners unless 
the thresholding process is modified or a special post- 
processing step is used. Before explaining the reme- 
dies to this problem, we need to know what causes 
this effect and how its magnitude of depends on the 
preprocessing strategy. 

Figure 7 shows what happens near a 90° corner 
with the Naive Averaging algorithm. Each input im- 
age is generated by sampling the test shape on a ran- 
domly shifted grid. Adding these images and thresh- 
olding them produces a rounded corner as shown by 
the dark shaded region in the figure. (Like all the fig- 
ures in this section, Figure 7 is based on a coordinate 
system with positive y oriented upward.) 



Figure 7: A magnified portion of the lower-right cor- 
ner on Test shape 2. The dark-shaded region is the 
result of adding-up and thresholding with no prepro- 
cessing. It is superimposed on the original test shape 
(lightly shaded). The dashed square shows the size 
of one input pixel. 

The Image Averaging algorithm attempts to align 
the resulting input images by shifting them so that 
their centroids coincide. If we make the optimistic 
assumption that this process correctly compensates 
for the random shift in the sample grid, we have the 
perfect positioning assumption. Under this assump- 
tion, the process of generating input images followed 
by Naive Averaging is equivalent to the following: 

1. Select n random points Pi, P?, .. . , P n in the unit 

square and let P; + Z 1 2 be the set of grid points 
obtained by adding integers to the x and y com- 
ponents of Pi. 


2. For each (as, y) and each i < n, the square 

1 - 1 1 1 
2 “ 2 ’ y 2 2 ( 2 ) 

contains a unique point from Pi + 1? . There are 
n such points for each (x,y). Color ( x,y ) black 
if at least n/ 2 of these are in the test shape. 

This rule for deciding the color of (x, y) amounts to 
looking at a crude estimate of the area of the intersec- 
tion of the square (2) and the test shape. Since the ex- 
pected error in this estimate is 0{l/^/n), the limiting 
behavior for large n is to include those points (x,y) 
for which at least half of the square (2) is in the test 
shape. This defines a function Fna that maps a test 
shape into the expected result of Naive Averaging. 

Since any line through the center of a square divides 
it into two equal pieces, Fna maps any half plane into 
itself. Now consider the 90° wedge W defined by 

x < 0, y > 0 (3) 

as shown in Figure 8a. If x < — the intersection 
of (2) and W has area > | when y > 0. If y > i, 
the intersection is at least 50% covered when x < 0. 
lfx> ^ or y<—^, the intersection area is always 
0. Hence Fna(VV) must equal W except possibly for 
points ( x , y) in a unit square centered on the origin. 
(This is the dashed square in Figure 8a.) 

If ( x,y ) is in the dashed square, the intersection of 
(2) with W has area 



Setting this equal to | gives a hyperbola that passes 
though (— |, 0) and (0, |). This is the curved bound- 
ary of the dark shaded region in Figure 8b. Note 
how the experimental result in Figure 7 agrees with 
computed shape shown in Figure 8b. 

The point is that the expected asymptotic result 
Fna{W) deviates from IT by a certain fraction of 
an input pixel and this distance depends only on the 
shape of W. The following theorem formalizes the 
idea that the result of Naive Averaging on W ap- 
proaches Fjvvi(ir) as the number of input images ap- 
proaches infinity. 

Theorem 4.1 For any region W and any probabil- 
ity p, there exist families of regions W~ and Wff that 
satisfy W~ C Fna(W) C W+, approach Fna(W) 
as n approaches oo, and have the following proper- 
ties: for any point P + FF+ , generating n input 
images by sampling W with randomly shifted grids 


Paee 6 



w 


w 

1 

1 

1 

1 

17 


L ' L ' 

(a) (6) 

Figure 8: (a) A 90° wedge W and a region (dashed) 
where Fna(W) might disagree with W; (b) The 
expected result Fna(W) of Naive Averaging (dark 
shaded) and the difference W \ Fna{W) (lightly 
shaded). Wedge W extends to infinity leftward and 
upward. 

and doing Naive Averaging under the perfect posi- 
tioning assumption produces a result Wna where the 
probability that P + £ Wna Is at most p; similarly, 
any P~ £ W~ is outside of Wna with probability at 
most p. 

Proof. We define W~ and Wf in terms of parameters 
yet-to-be-chosen parameters a~ and «+. Let W~ be 
the set of all ( x , y) for which the intersection of W 
with the square (2) has area at least a ~ ; region W+ 
is similar except the area bound is a+. This gives 
W~ C Fna(W) C W+ whenever a~ > | > a+. 
Furthermore, W~ and Wff approach Fna(W) if a~ 
and af approach | as n approaches oo. 

It remains to show that af and a~ can be chosen 
so that P + £ Wna and P~ £ Wna with arbitrarily 
small probability. At ( x,y ) = P + , the fraction a of 
square (2) within W is at most a + , and is < \- 
The probability that W contains at least n/ 2 of the 
n sample points in square (2) is 


The probability that P <£ Wna is the probability 

that more than n / 2 of the sample points in square 
(2) are outside of W when ( x , y) is such that at most 
1 — a~ of the square is outside W. Choosing a~ = 
1 — af makes this probability at most p as required. 
□ 

4.1 The Asymptotic Behavior of Pres- 
moothing and Smooth Shading 

We have analyzed Naive Averaging under the per- 
fect positioning assumption with only spatial sam- 
pling error. Even under these favorable conditions, 
it approaches something different from the desired 
shape as the number of input images approaches in- 
finity. Do the Presmoothing and Smooth Shading 
algorithms suffer from similar defects? 

First, consider the presmoothing algorithm. It is 
harder to analyze than Naive Averaging because the 
polygonal smoothing algorithm expands the area af- 
fected by each sample point. In the simple case 
of the 90° wedge (3), the analysis is easy because 
the smoothing algorithm leaves such 90° corners un- 
changed. Hence, the previous analysis holds and the 
expected result as the number of input images n 
approaches infinity deviates from the 90° wedge as 
shown in Figure 8. 

Now consider the Smooth Shading algorithm for 
the 90° wedge W given by (3) under the perfect po- 
sitioning assumption. The preprocessed image differs 
from that of the Naive Algorithm in that the sharp 
edges shown in Figure 9a are replaced by gradual 
shading from black to white over some distance 2e 
as shown in Figure 9b. (The recommended value for 
e in Section 2.3 is |.) 


E ( Ja fc (l -«)"-*< 2" E «*(!-<*)” 

k>n/2 ' ' k>n/2 

2 n a r«/2l (1 _ a )L«/ 2 J 


-k 


1 - ^ 
1 — Of 


(5) 


2«Q,n/ 2 (l _ a )n /2 U a _ 4 q 2W2 

< ; — — < : t . (6) 


1 — 2a 
1 — a 


1 — 2a 


For any fixed a < this approaches zero as n ap- 
proaches oo. Hence solving 

(4a+ -4 (q+) 2 )"/ 2 _ 

1 - 2a+ ~ P 

for af yields a function that approaches | as n ap- 
proaches oo. Since (6) bounds the probability that 
P + £ Wna , this satisfies the theorem. 


2e 



(«) (*) 


Figure 9: (a) A 90° wedge; (b) the effect the prepro- 
cessing used by the Smooth Shading algorithm. The 
dashed line shows the corresponding position of the 
edge of the 90° wedge. 

For Naive Averaging, the expected darkness at a 
point ( x , y) before the thresholding step was the area 
of the 90° wedge W inside the ( x,y ) centered unit 
square (2). For Smooth Shading, the corresponding 
rule is similar except that the 90° wedge is replaced 
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by the smooth-shaded version in Figure 9b and the 

area becomes the integral of the shading density over 
the square. 

For example if (x,y) = (— jb, |) and e = g, the 
shading density functions for (2) are as in Figure 10. 
There are three subregions where the shading density 
is nonzero, and a different shading density function 
applies within each subregion as indicated in the fig- 
ure. For the subregion labeled , the integral is 



15 19 1 173 

104 + 1248 “ 48 ~ 1248 


Similar computations show that the integral of 1 ~ 3r 
over the indicated region is || and the integral of 1 
over the indicated rectangle is ^ This gives a total 
of J 248 + ff + = g , indicating that thresholding 

at | places ( x,y ) = (— dg, g) is on the boundary. 


(_I 5) (l 

v 3 ’ 6 ' 6J 



Figure 10: Shading density functions for the square 
(2) with (x,y) = (-dg,§) and e = 

Since (— yg, |) is not on the boundary of W, 
the expected result of Smooth Shading on the 90° 
wedge W differs from W by a constant amount, 
even as the number of inputs approaches infinity. It 
does not seem worthwhile to try to derive expres- 
sions that define the boundary of the expected re- 
sult W' of Smooth Shading on W under the perfect 
positioning assumption with only spatial sampling er- 
ror, but the authors’ numerical experiments indicate 
that the closest approach to (0, 0) is approximately 
(—0.181, 0.181) and that the total area oiW\W' is 
approximately 0.091. 

4.2 Restoring Sharp Corners 

The function Fma(W) that models the effects of spa- 
tial sampling error followed by Naive Averaging is 
equivalent to blurring and then thresholding. Specif- 
ically, the blurring function is to convolve with a unit 


square of uniform density. This suggests replacing the 

thresholding step with deblurring followed by thresh- 
olding. The motion deblurring algorithm of Lee and 
Vardi is appropriate for this application [10, 14]. 

Alternatively, we can try to invert the function Fma 
or whatever function is appropriate for the form of 
Image Averaging that is being used. This pseudo- 
inverse can be applied after smoothing the outlines 
extracted from the thresholded image as a final post- 
processing step. This strategy was adopted for the 
following reasons: 

1. Smooth Shading and Presmoothing are hard to 
model with blurring functions, and these strate- 
gies performed better than Naive Averaging in 
our tests. 

2. Dropping the perfect positioning assumption and 
allowing error sources other than spatial sam- 
pling error makes it hard to know what blurring 
kernel to use. 

3. Post-processing the final outlines requires signif- 
icantly less space and run time than deblurring 
would. 

As the above points imply, correcting for the sharp 
corner problem is necessarily a somewhat heuristic 
procedure. We just need to start with something that 
has parameters to choose and does roughly the op- 
posite of what Fma does. In order to get a better 
understanding of Fma, consider the region W that 
satisfies 

y > mix, y > m 2 x, where — 1 < mi < m 2 < 1 

as shown in Figure 11. We need an expression for 
the area of the square (2) within W when |x| < 1 
and (x, y) is such that W contains (x ± y + |) but 
neither of the points (x ± |, y — |). 



Figure 11: The region W for the derivation of Fma 
with the square (2) dashed. In this case mi = 0.15 
and m 2 = 0.85. Note that W extends to infinity 
leftward and upward. 
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This area is 


/" 

J x — 


1/2 


J/ + - - mix dx + 


L 


x+l/2 


y+2~ rn 2 x dx 


y+2 + 


1 mi(i - i) 2 m 2 (x + \f 


Setting this equal to \ gives the boundary of F^a{W) 


1\2 


V = 


m 2 (x+±y m\{x — 


m2 — mi _ 9 mi 4 - m2 _ m2 — mi 
x 4 x 4 

1 n 1 n 


(7) 


This hits the boundary of W at |m i) and 

(|, |m 2 ) and passes through (0, ). 

Now consider the total area inside W but below 
Fna{W). This is the integral of (7) on — | < x < | 
minus the integral of max(mi®, m 2 x) over the same 
interval. Integrating (7) gives 


m 2 - mi m 2 - mi 

24 + + 8 


and 


r i/2 

/ max(mi®, m 2 ®) d® 
7-1/2 

= / mi a; da; + / 
d-i/2 7 o 


mi m 2 
m 2 ® dx = — - — I — — . 

8 8 


Thus the area of W \ .F7 \m(W ; ’) is m2 ~ 4 mi . 

This suggests that a pseudo inverse of Fna( W) 
should add area near each convex corner of W and 
subtract area near each concave corner. The amount 
of area should be roughly proportional to the angle 
at the corner. Algorithm 2 shows one way to per- 
form this operation on a polygonal outline. It adds 7 
units of area per radian, where 7 is a parameter to be 
determined empirically. For small angles, the above 
argument suggests additional area for an an- 
gle of approximately so that 7 « 

This ranges from ^ to ^ depending on mi and m 2 . 
We also found an area difference of 0.091 for Smooth 
Shading with an angle of This suggest 7 « 


5 An Application 

We applied an image averaging method (smooth 
shading) to a challenging OCR problem, a selection 
from which is shown in Figure 12. This consists often 
pages of text which appears to have been typewrit- 
ten then photo-offset: the image quality is variable 


Algorithm 2 An algorithm for postprocessing the 
smoothed outlines from the thresholded image to cor- 
rect for the sharp corner problem by adding 7 units 
of area per radian. 

1. Let y' = 7/20 and repeat Steps 2-6 20 times. 

2. For each polygon edge, compute a shift amount 
Si = Y(6i + (f>i)/(2li) where li is the length of 
the edge and 9j and fa are the angles at the sur- 
rounding vertices. 

3. For each edge, find the line li obtained by shift- 
ing the edge perpendicularly s, units to its right. 

4. For each vertex 17, find the point v[ where the A 
and ij for the edges incident on 17 intersect. This 
forms a new polygon whose edges are parallel to 
those of the old polygon. 

5. For each edge of the new polygon whose orien- 
tation is opposite that of the corresponding old 
polygon edge, collapse the edge to a point where 
the ii and ij for the surrounding edges intersect. 

6. Repeat Step 5 until all the edge orientations 
agree. Then update the old polygon to equal 
the new polygon. 


Paffe 9 



and often low. The text is a collection of 50 puz- 
zle cryptograms for hobbyists; as a result, contextual 
constraints are unusually weak. The “body text” — 
setting aside the headers — contains 23684 characters 
including word-spaces. The single fixed-pitch type- 
writer face remains unidentified (we have not at- 
tempted to match it with the faces in our collection 
of typographer’s artwork). 

QAXBJ IASHC OQXEO SBXZO 
XIOSS FQUTC NBUHB HOJAT 
KEJAT ONESC FKFQJ EUHSO 

Figure 12: A magnified portion of the body text of a 
collection of puzzle cryptograms for hobbyists. The 
pages were originally typewritten, and then photo- 
offset. 

We processed these images in two phases. In the 
first phase, we ran our experimental page reader using 
a classifier trained on twenty fixed-pitch faces. This 
involved page layout, fixed-pitch processing, charac- 
ter classification, shape-directed resegmentation, and 
crude contextual filtering (be. each space-delimited 
word was expected to be either all upper case alpha- 
betic or all numeric). We measured the “classifica- 
tion error” of this phase as follows. First, we semi- 
manually identified every pair of strings of characters 
(T, E) where T is from the ground truth and E is 
the string erroneously substituted for T by the page 
reader. T or E may include spaces and may be empty. 
Each (T, E) pair is locally minimal in that T and E 
share no character. We adopted the policy that each 
pair contributes to the error count a score equal to 
the maximum of the lengths of T and E, unless that 
number exceeds 10, in which case it is ignored (forced 
to zero), on the somewhat arbitrary assumption that 
the pair results from a layout, rather than a classifi- 
cation, mistake. Under this policy we counted 1706 
classification errors, for an nominal error rate of 7.2% 
of the characters. 

In the second phase, we attempted to improve the 
error rate completely automatically by applying the 
image averaging algorithm, as follows. We sorted the 
character images from the first phase by their top- 
choice character labels (a subset of the images labeled 
“I” is shown in Figure 13. No attempt, automatic or 
manual, was made to set aside erroneously labeled 
images. 

Then, for each character label, all of its images 
were given as input to the image averaging algorithm. 


imilllrrKLI 
IITILIIIIIf l|? 
IIIILIIIMI1I 
IIL|HILllTlI 

minium? 

Figure 13: These are 67 magnified images selected 
from the total of 1009 labeled “I”. Many, but not 
most, are mislabeled. Some are fragments of char- 
acters due to missegmentation, and some are pencil 
annotations. 

Each output of the image averaging algorithm — the 
entire alphabet is shown in Figure 14. — was then used 
as an “ideal artwork” seed for input to a pseudoran- 
dom generator that wrote 150 degraded images (Fig- 
ure 15 shows examples for “I”). These were used to 
train a classifier which was used to read the pages 
a second time. The output was then scored, as de- 
scribed above. The result was a 20% reduction in 
error, to 1376 errors, or nominal error rate of 5.8%. 

0123456789 

ABCDEFGHI 

JKLMNOPQR 

STUVWXYZ 

Figure 14: The body text consisted only of upper case 
alphabetics and numerals. The “I” shown here should 
be compared with the images in Figure 13, a subset 
of those from which it was automatically inferred. 

We do not yet know how to characterize the cir- 
cumstances under which this strategy, of inferring 
prototypes from sets of possibly imperfectly labeled 
images, will be safe. All we can say at present is 
that, in this particular application, faced with an av- 
erage error rate of about 7%, it worked well. If a user 
were willing to oversee the method, a quick glance at 
the inferred prototypes should be sufficient to guard 
against disaster. 

Another strategy for automatically retraining a 
classifier on imperfectly labeled data has been de- 
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iiTinniiiiniiniiiiin 

miMiiiiiiiiiimiiiii 

iixiiiiiiliiiiiiniiiii; 

iTiliililliniIiIIlll: 

iiiniiiiiiiiiiim 

iiiiiiimiiiiiil 

Figure 15: For each letter of the alphabet, 150 images 
were generated using the prototype artwork inferred 
by the image averaging algorithm. 

scribed in [4]. The present work departs from this in 
(a) not using the real images directly; and (b) relying 
heavily on synthetic images from an image degrada- 
tion model. In future work, we hope to test whether 
these differences are relatively advantageous and dis- 
advantageous. It should be straightforward to pursue 
both of these strategies simultaneously on the same 
document — potentially combining their advantages — 
simply by mixing the real and synthetic data before 
retraining. 

Perhaps, in the future, we can find ways to infer au- 
tomatically, not merely faithful prototypes for char- 
acter artwork, but the parameters of the document’s 
image degradation as well: this may allow even more 
effective automatic bootstrapping. 

6 Summary 

We have described an algorithm for image restoration 
which is specially adapted to applications in docu- 
ment image analysis. It accepts as input a set of 
bilevel images, and attempts to infer from them a 
single bilevel image at a higher spatial sampling rate. 
This output image should be similar to the ideal 
prototype of which the input images are degraded 
derivatives. Our basic approach follows prior art by 
“averaging” the images: the input images are super- 
imposed, intensities are added up, and the result is 
thresholded. We show that a naive implementation 
of this is inferior to an algorithm which specially pre- 
processes the input images and postprocesses the re- 
sult. The analysis concentrates on degradations due 
to uniform spatial sampling error. Experimental tri- 
als on a variety of degradations and symbol shapes are 
described. Results on artificial test shapes illustrate 
the advantages of the new algorithm, particularly in 
ameliorating rounding effects at sharp angles along 


the boundary. 

Future work should include analysis and experi- 
ments on a wider variety of symbol shapes and degra- 
dations, including images lifted from actual docu- 
ments. There also needs to be more work on the 
sharp corner problem, particularly the alternative 
mentioned in Section 4.2 involving deconvolution. Al- 
gorithm 2 is ad hoc and performed poorly on features 
such as the serifs on the Times Roman ‘R’. 
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